KAN and RinSCut: Lazy Linear Classifier and Rank-in-Score Threshold in Similarity-Based Text Categorization
نویسندگان
چکیده
Two important research areas in statistical approaches for automated text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization systems. After researching common techniques in both areas, we describe a lazy linear classifier known as the keyword association network (KAN) and a rank-in-score (RinSCut) thresholding strategy to improve the categorization performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 data set. The experimental results show that KAN outperforms two linear classifiers, Rocchio and Widrow-Hoff, and gives results competitive with k-NN. All implemented classifiers except Widrow-Hoff show performance improvements with RinSCut.
منابع مشابه
Text Categorization with a Small Number of Labeled Training Examples
This thesis describes the investigation and development of supervised and semisupervised learning approaches to similarity-based text categorization systems. It uses a small number of manually labeled examples for training and still maintains effectiveness. The purpose of text categorization is to automatically assign arbitrary raw documents to predefined categories based on their contents. Tex...
متن کاملA Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization
Two main research areas in statistical text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (k-NN and Rocchio) and three common thresholding techniques (RCut, PCut, and SCut), we describe...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملImproving kNN Text Categorization by Removing Outliers from Training Set
We show that excluding outliers from the training data significantly improves kNN classifier, which in this case performs about 10% better than the best know method—Centroid-based classifier. Outliers are the elements whose similarity to the centroid of the corresponding category is below a threshold.
متن کاملA New Inductive Learning Method for Multilabel Text Categorization
In this paper, we present a new inductive learning method for multilabel text categorization. The proposed method uses a mutual information measure to select terms and constructs document descriptor vectors for each category based on these terms. These document descriptor vectors form a document descriptor matrix. It also uses the document descriptor vectors to construct a document-similarity m...
متن کامل